Dyr og Data

Data Wrangling with dplyr

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-08-28

Learning objectives

In this section of the data wrangling topic you will

  • Learn the key data wrangling verbs

  • Learn how to summarise data

Readings

  • Chapter 3 Data transformation

Data wrangling with dplyr 📦

Data wrangling

Wrangling data is (or should be) the most time-consuming aspect of a study — once collected!

Wrangling defined as

engagement in a long, complicated dispute or argument

round up, herd, or take charge of (livestock)

alternative to wangle, to manage to obtain (something) by persuading or cleverly manipulating someone

Better alternative to

  • data processing
  • data munging
  • data manipulation

Data wrangling and dplyr

The dplyr package provides five key functions to solve most of the data wrangling issues you’ll face

  1. Choose observations (rows) based on values of variables (cols) — filter()
  2. Reorder the observations (rows) — arrange()
  3. Choose variables by name or other feature — select()
  4. Create new variables as functions of existing ones — mutate()
  5. Collapse many values into a single summary — summarise()

Each of these functions can be used with group_by() so that they affect each subset of data (group) in turn

These are the key data wrangling verbs

Guiding principles

All verbs work in the same way, providing a consistent workflow with known outputs

  1. the first argument to the verbs is always a data frame

  2. subsequent arguments define what you want to with the data frame

  3. the result of each verb (what is returned by the function) is always a data frame

Palmer penguins

library("palmerpenguins")
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter()

Subset observations based on the values or one or more of the columns

penguins |>
  filter(island == "Torgersen")
# A tibble: 52 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 42 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter()

Combine filters using boolean operators

penguins |>
  filter(sex == "female" & island == "Biscoe")
# A tibble: 80 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           35.9          19.2               189        3800
 3 Adelie  Biscoe           35.3          18.9               187        3800
 4 Adelie  Biscoe           40.5          17.9               187        3200
 5 Adelie  Biscoe           37.9          18.6               172        3150
 6 Adelie  Biscoe           39.6          17.7               186        3500
 7 Adelie  Biscoe           35            17.9               190        3450
 8 Adelie  Biscoe           34.5          18.1               187        2900
 9 Adelie  Biscoe           39            17.5               186        3550
10 Adelie  Biscoe           36.5          16.6               181        2850
# ℹ 70 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter()

Combine filters using boolean operators

penguins |>
  filter(sex == "male" | bill_length_mm > 41)
# A tibble: 265 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.3          20.6               190        3650
 3 Adelie  Torgersen           39.2          19.6               195        4675
 4 Adelie  Torgersen           42            20.2               190        4250
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           38.6          21.2               191        3800
 7 Adelie  Torgersen           34.6          21.1               198        4400
 8 Adelie  Torgersen           42.5          20.7               197        4500
 9 Adelie  Torgersen           46            21.5               194        4200
10 Adelie  Biscoe              37.7          18.7               180        3600
# ℹ 255 more rows
# ℹ 2 more variables: sex <fct>, year <int>

The pipe |>

The nature of dplyr verbs leads to a workflow where we create intermediate steps

  1. group the data using group_by()
  2. summarise the grouped data by group using summarise()

We perhaps don’t need to store the output of the first operation

The pipe operator |> allows use to compose dplyr verbs into a workflow to achieve a data wrangling outcome

arrange()

Sort the rows according to a condition or variable

penguins |>
  arrange(bill_length_mm)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

desc()

Arrange in descending order using desc()

penguins |>
  arrange(desc(bill_length_mm))
# A tibble: 344 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

select()

Choose which variables to view or work with

penguins |>
  select(c(species, sex, bill_length_mm, body_mass_g))
# A tibble: 344 × 4
   species sex    bill_length_mm body_mass_g
   <fct>   <fct>           <dbl>       <int>
 1 Adelie  male             39.1        3750
 2 Adelie  female           39.5        3800
 3 Adelie  female           40.3        3250
 4 Adelie  <NA>             NA            NA
 5 Adelie  female           36.7        3450
 6 Adelie  male             39.3        3650
 7 Adelie  female           38.9        3625
 8 Adelie  male             39.2        4675
 9 Adelie  <NA>             34.1        3475
10 Adelie  <NA>             42          4250
# ℹ 334 more rows

select() — many options for selecting

matches()

penguins |>
  select(matches('length'))
# A tibble: 344 × 2
   bill_length_mm flipper_length_mm
            <dbl>             <int>
 1           39.1               181
 2           39.5               186
 3           40.3               195
 4           NA                  NA
 5           36.7               193
 6           39.3               190
 7           38.9               181
 8           39.2               195
 9           34.1               193
10           42                 190
# ℹ 334 more rows

last_col()

penguins |>
  select(last_col())
# A tibble: 344 × 1
    year
   <int>
 1  2007
 2  2007
 3  2007
 4  2007
 5  2007
 6  2007
 7  2007
 8  2007
 9  2007
10  2007
# ℹ 334 more rows

select() — many options for selecting

from:to

penguins |>
  select(species:flipper_length_mm)
# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows

select() — many options for selecting

starts_with()

penguins |>
  select(starts_with("s"))
# A tibble: 344 × 2
   species sex   
   <fct>   <fct> 
 1 Adelie  male  
 2 Adelie  female
 3 Adelie  female
 4 Adelie  <NA>  
 5 Adelie  female
 6 Adelie  male  
 7 Adelie  female
 8 Adelie  male  
 9 Adelie  <NA>  
10 Adelie  <NA>  
# ℹ 334 more rows

select() — many options for selecting

Many options for choosing columns in select — see ?dplyr::select

summarise()

Summarise data to a single row

penguins |>
  summarise(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
  )
# A tibble: 1 × 1
  avg_bill_length
            <dbl>
1            43.9

group_by()

summarise() becomes much more useful if we group data first

penguins |>
  group_by(species) |>
  summarise(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
  )
# A tibble: 3 × 2
  species   avg_bill_length
  <fct>               <dbl>
1 Adelie               38.8
2 Chinstrap            48.8
3 Gentoo               47.5

group_by()

Can group by multiple variables; grouping applied in the order given

penguins |>
  group_by(species, sex) |>
  summarise(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
  )
# A tibble: 8 × 3
# Groups:   species [3]
  species   sex    avg_bill_length
  <fct>     <fct>            <dbl>
1 Adelie    female            37.3
2 Adelie    male              40.4
3 Adelie    <NA>              37.8
4 Chinstrap female            46.6
5 Chinstrap male              51.1
6 Gentoo    female            45.6
7 Gentoo    male              49.5
8 Gentoo    <NA>              45.6

Multiple summaries

summarise() can add any number of relevant summaries

penguins |>
  group_by(species) |>
  summarise(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    n = n()
  )
# A tibble: 3 × 3
  species   avg_bill_length     n
  <fct>               <dbl> <int>
1 Adelie               38.8   152
2 Chinstrap            48.8    68
3 Gentoo               47.5   124

mutate()

mutate() creates new variable or modifies existing ones

penguins |>
  group_by(species) |>
  summarise(
    sum_bill_length = sum(bill_length_mm, na.rm = TRUE),
    n = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
  ) |>
  mutate(avg_bill_length_2 = sum_bill_length / n)
# A tibble: 3 × 5
  species   sum_bill_length     n avg_bill_length avg_bill_length_2
  <fct>               <dbl> <int>           <dbl>             <dbl>
1 Adelie              5858.   152            38.8              38.5
2 Chinstrap           3321.    68            48.8              48.8
3 Gentoo              5843.   124            47.5              47.1

Why don’t the two averages match?

mutate()

The issue is the missing values; let’s fix that

penguins |>
  filter(! is.na(bill_length_mm)) |>
  group_by(species) |>
  summarise(
    sum_bill_length = sum(bill_length_mm, na.rm = TRUE),
    n = n(),
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
  ) |>
  mutate(avg_bill_length_2 = sum_bill_length / n)
# A tibble: 3 × 5
  species   sum_bill_length     n avg_bill_length avg_bill_length_2
  <fct>               <dbl> <int>           <dbl>             <dbl>
1 Adelie              5858.   151            38.8              38.8
2 Chinstrap           3321.    68            48.8              48.8
3 Gentoo              5843.   123            47.5              47.5